Protein structures and optimal folding emerging from a geometrical variational 

principle 
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Novel numerical techniques, validated by an analysis of barnase and chymotrypsin inhibitor, are 
used to elucidate the paramount role played by the geometry of the protein backbone in steering 
the folding to the correct native state. It is found that, irrespective of the sequence, the native 
state of a protein has exceedingly large number of conformations with a given amount of structural 
overlap compared to other compact artificial backbones; moreover the conformational entropies of 
unrelated proteins of the same length are nearly equal at any given stage of folding. These results 
are suggestive of an extremality principle underlying protein evolution, which, in turn, is shown to 
be associated with the emergence of secondary structures. 



The rapid and reversible folding of protein-like het- 
eropolymers into their thermodynamically stable native 
state [ [jj is accompanied by a huge reduction in confor- 
mational entropy [ . Evidence has been accumulat- 
ing for an achievement of the entropy reduction through 
a folding funnel which favors the kinetic accessibility of 
the native state [ |^,[To|,|6|-p| . Some fundamental questions 
remain, however, unanswered. What makes proteins spe- 
cial compared to random heteropolymers? What guides 
the folding of a protein? Is it the sequence that is funda- 
mental or its native structure? 

In this letter, we examine these issues and focus on 
the special role played by the native structure of pro- 
teins, with no input of information regarding amino acid 
sequences. The study is carried out through a novel the- 
oretical probe for the conformation space of proteins: 
a measure of the density of alternative conformations 
(DAC) having a given overlap or percentage of contacts 
in common with a fixed native structure. We demon- 
strate with studies on chymotrypsin inhibitor (2ci2) and 
barnase (la2p) that the DAC provides key information 
on the folding nucleus [ 20 1 . An analysis of the DAC for 
real protein structures and for artificially generated decoy 
ones suggests that an extremal principle is operational in 
nature, which maximizes the DAC at intermediate over- 
lap, providing a large basin of attraction [ ||||[l(|[||7]] for 
the native state and promoting the emergence of sec- 
ondary structures. 

Operationally, our study consists of the determination 
of the number of alternative structures which have a 
given structural similarity to a putative native state. The 
structural similarity between the native structure and an 
alternative one is defined as the percentage of common 
native contacts in the alternative conformation. It is well 
known that such a measure is a good coordinate charac- 
terizing the folding process [ [ll|-[l3]] . Following standard 
practice, two residues are defined to be in contact if the 
distance between their C a atoms is less than 6.5A In an 
unbiased study, conformations that differ slightly should 



not be considered distinct. To avoid this problem, we 
perform a coarse-graining of the configurational degrees 
of freedom by adopting the discretization approach in- 
troduced by Covell and Jernigan [ [TiJ], where the C a 's 
occupy sites on a suitably oriented FCC lattice (of edge 
3.8 A). This discretization does not distort the peptide 
angles and the position of the coarse-grained C Q 's dif- 
fer from the true ones by typically less than 1 A RMSD [ 
lab For proteins of about 100 residues, the contact maps 
[M of the real and FCC coarse-grained contacts maps 
are virtually identical. 

The generation of alternative conformations was car- 
ried out using a Monte Carlo procedure. A starting con- 
formation was successively modified by displacing the 
C Q 's to unoccupied positions of the FCC lattice. The 
move of an amino acid to an unoccupied site is allowed 
only if the new conformation satisfies certain constraints 
of steric overlap and peptide geometry. These constraints 
(any two non-consecutive residues cannot be closer than 
4.65A due to excluded volume effects and the peptide 
bond is not stretched beyond 5.37A) were determined 
after carrying out an FCC coarse-graining of several pro- 
teins of intermediate length (« 100 residues) and en- 
forced in the generation of alternative protein-like con- 
formations. 

In order to minimize the effects of correlation between 
successively generated structures, we typically discarded 
50 elementary moves before accepting each new confor- 
mation. A newly generated conformation was accepted 
with the usual Metropolis rule according to the change in 
the Boltzmann weight: e A ^ KBT , where A is the change 
in contact overlap and T is a fictitious temperature. By 
choosing T appropriately, one can readily generate al- 
ternative conformations with a desired average contact 
overlap, q. At a given temperature, the true number 
of alternative structures with overlap q is proportional 
to the number of states with overlap q obtained in the 
simulation multiplied by the Boltzmann weight. On un- 
doing the Boltzmann bias, it is possible to recover the 
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true density of states in a region around q. In order to 
obtain the density of states for all values of overlap, we 
performed Monte Carlo samplings at different tempera- 
tures and then used standard deconvolution procedures 

We begin with the backbones of the chymotrypsin in- 
hibitor (2ci2) and barnase (la2p) and generated alterna- 
tive structures with a not too large overlap [ [l8| (« 40%) 
for each of them. It turned out that the most frequent 
contacts shared by the native conformation of 2ci2 with 
the alternative ones involved the helical-residues 30-42 
(see top Fig. [j]) and the rarest ones pertained to in- 
teraction between the helix and /3-strands and between 
the /3-strands themselves. This is in excellent agreement 
with the studies of Fersht et al. [ , which demon- 

strated the formation of the helix at early stages of the 
folding. A different behaviour (see bottom Fig. [2]) was 
found for barnase, where, again, for overlap of ss 40%, 
we find many contacts pertaining to the nearly complete 
formation of helix 1 (residues 8-18), a partial formation 
of helix 2, in particular bonds between residues 26-29 
and 29-32 as well as several non-local contacts bridging 
the /3-strands, especially residues 51-55 and 72-75. This 
picture is fully consistent with the experimental results 
obtained in ref. 

This provides a sound a posteriori justification that 
the main features of the folding of a protein can be fol- 
lowed from a study of the DAC. Remarkably, the method 
discussed above relies entirely on structure-related prop- 
erties and suggests that the features of the folding funnel 
are determined by the geometry of the "bare" backbone, 
while the finer details, of course, depend on the specific 
well-designed sequence. 

We now turn to an analysis of three proteins of length 
51 (lhcg, lhja and lsgp) which have nearly the same 
number of native contacts (« 83). For each structure, 
we calculated the DAC with the constraint that the to- 
tal number of contacts in the alternative structures do 
not exceed 88 to avoid excessive compactness. In or- 
der to assess whether the DAC associated with naturally 
occurring proteins had special features, we generated 
three decoy compact conformations of the same length 
and number of contacts, but with different degrees of 
short and long range contacts (in sequence separation). 
These decoys (subject to the aforementioned "physical 
constraints" ) were generated with a simulated annealing 
procedure to find the structure with the highest overlap 
with a target contact matrix. By tuning the number of 
short-range versus long-range entries in the target ran- 
dom contact matrix, we generated three structures with 
different degree of compactness and local geometrical reg- 
ularity. 

The plots of the DAC are shown in Fig. ^. A strik- 
ing feature of the curves is that, for intermediate overlap, 
the DAC of the real proteins is enormously larger than 
that of the decoys (note the logarithmic scale) and sug- 
gests that naturally occuring conformations have a much 
larger number of entryway structures than random com- 



pact conformations. Furthermore, for very high values of 
the overlap, the steepness of the protein curves is much 
larger than those of the decoys, showing that the reduc- 
tion in the conformational entropy is also correspondingly 
higher. This translates into the existence of a funnel with 
a very large basin and steep walls. Another significant 
feature is the good collapse of the protein curves. We 
have verified that this feature also obtains for lbdO and 
2pk4 which each have 80 residues and 140 and 146 con- 
tacts respectively. A simple explanation for the curve 
collapse could be that the density of states for real pro- 
teins is "extremal" , in that it is close to the maximum 
possible value for intermediate values of the overlap. 

The importance of the locality of contacts for folding 
kinetics was highlighted recently by Plaxco et al. [ |2^] 
who found a correlation between folding rate and con- 
tact order, defined as the average sequence separation of 
contacts normalized to the total number of contacts and 
sequence length. With reference to Fig. [| the contact 
order value for protein lhcg, 1 hja and lsgp is 0.139, 
0.214 and 0.204 respectively. For the decoy structures, it 
is 0.424, 0.222 and 0.179 for the curves denoted by open 
squares, pentagons and hexagons, respectively. The low- 
est curve in the figure is indeed associated with an un- 
usually high contact order in accord with the findings of 
Plaxco et al. [ . 

A ubiquitious feature of protein structures is the ex- 
istence of secondary structure motifs [ |24|,|25|]. We have 
carried out some simple investigations to assess whether 
a correlation exists between the extremality of the DAC 
curve and the emergence of secondary-structure-like mo- 
tifs. 

We considered a space of contact maps [ [T|], within 
which each of the residues interacted with the same num- 
ber of other residues, n c (typically n c — 5, as in the aver- 
age case of a protein with about 100 residues and a cutoff 
distance of 6.5 A). This space contains both maps cor- 
responding to real structures and unphysical ones. Fur- 
thermore, to mimic the effects of the rigidity and geome- 
try of the peptide bond, we disallowed contacts between 
residue i and the four neighboring residues along the se- 
quence i — 2, i — 1, 2 + 1 and i + 2. 

In this context, the maximization of the density of 
states corresponds to finding the target matrix with the 
highest number of matrices sharing a given fraction of its 
contacts. Although it is difficult to solve this problem, 
for arbitrary values of the overlap, it is relatively easy 
to generate matrices with an overlap close to the maxi- 
mum value, q max (for a LxL matrix, q ma x = L ■ n c ). To 
enumerate all matrices with overlap q m ax — 2, one first 
identifies a pair of non-zero entries in the target matrix 
to: fhij — fhki = 1. Then it is necessary to check whether 
entries fhu,fhkj are both "free" (i.e. equal to zero) and 
do not correspond to forbidden contacts (e.g. between 
i and i + 1). If this is so, the old pair of entries (and 
their symmetric counterpart) are set to zero, and the new 
ones to 1. By considering, in turn, all possible pairs of 
non-zero entries one can generate all matrices of overlap 
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Qmax ~ 2 . Then, by performing a simulated annealing 
in contact-map space one can isolate the map having the 
highest number of matrices with overlap q max — 2. 

We carried out our calculations for values of L around 
60. The optimal matrices appear to have features rem- 
iniscent of a-helices and /3-sheets, as shown in Fig. 
^. A more quantitative measurement of the secondary- 
structure content of the optimal matrices can be obtained 
by considering the correlation functions 

9i( x ) = ^2rn i7i+x ; g 2 (x) = ^ m itX -i (1) 

i i 

which show peaks in correspondence with the sequence 
separation of residues involved in a-helices and parallel 
/3-sheets (gi) or antiparallel /3-sheets (32)- 

A typical plot of the correlation functions for an op- 
timal map of length 60 and for the protein 3ebx (length 
62) are shown in Fig. ||. The similarity of the plots is 
striking, particularly because, in both cases, the height 
of the peaks in 171 decreases with sequence separation, 
unlike the situation with §2- 

In summary, novel numerical techniques are used to 
elucidate the paramount role played by the geometry of 
the protein backbone in providing a large basin of attrac- 
tion to the native state. It is found that, irrespective of 
the sequence, the native state of a protein has an exceed- 
ingly large number of conformations with a given amount 
of structural overlap compared to other compact artificial 
backbones. Strikingly, by studying the conformational 
entropy of a backbone it is possible to identify the fold- 
ing nucleus with no input of the actual protein sequence. 
Moreover, the conformational entropies of unrelated pro- 
teins of the same length are nearly equal at any signifi- 
cant value of the reaction coordinate [ [l^] . These results 
are suggestive of an extremality principle underlying the 
selection of naturally occurring folds of proteins which, 
in turn, is shown to be associated with the emergence of 
secondary structures. Our procedure ought to be useful 
for the generation of alternative conformations necessary 
for protein design and the determination of the effective 
interactions between amino acids. 
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FIG. 1. Ribbon plot (obtained with RASMOL) of 2ci2 
(top) and barnase (bottom). The residues involved in the 
12 [16] most frequent contacts of alternative structures with 
overlap « 40% with the native conformations are highlighted 
in black. The majority of these coincide with contacts that 
are formed at the early stages of folding. 



g 60 


7 J 




- 




i 


2CI2 




cL 40 








OJ 

w 








OJ 








C 20 

OJ 








cr 








OJ 
m 


i 




• 



0.02 0.04 0.06 

Frequency 



O 20 40 60 fit) lOO 

overlap (%) 

FIG. 3. Density of states for proteins for lsgp (filled 
squares), lhja (filled pentagons) and lhcg (filled hexagons). 
Curves for artificial decoy structures are denoted by the open 
symbols. 
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FIG. 4. The upper [lower] triangle shows a target contact 
matrix with L = 60 that has a large [intermediate] number of 
contact maps with an overlap of q ma x — 2 contacts. 
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FIG. 2. Distribution of sequence separation of contacts 
common in alternative conformations for 2ci2 and la2p. The 
most frequent contacts in 2ci2 have a small sequence sepa- 
ration (3-4) and pertain to helix formation. Ia2p shows a 
very different behaviour with several contacts with very large 
sequence separation. 
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FIG. 5. Correlation functions (see equation 1) for an opti- 
mal target matrix of length 60 and for protein 3ebx. 
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